System and method for recognition and annotation of facial expressions

ABSTRACT

The innovation disclosed and claimed herein, in aspects thereof, comprises systems and methods of identifying AUs and emotion categories in images. The systems and methods utilized a set of images that include facial images of people. The systems and methods analyze the facial images to determine AUs and facial color due to facial blood flow variations that are indicative of an emotion category. In aspects, the analysis can include Gabor transforms to determine the AUs, AU intensities and emotion categories. In other aspects, the systems and method can include color variance analysis to determine the AUs, AU intensities and emotion categories. In further aspects, the analysis can include deep neural networks that are trained to determine the AUs, emotion categories and their intensities.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and priority to pending PCTApplication Serial Number PCT/US17/35502 entitled “SYSTEM AND METHOD FORRECOGNITION AND ANNOTATION OF FACIAL EXPRESIONS,” filed Jun. 1, 2017,which claims the benefit of and priority to U.S. Provisional PatentApplication Ser. No. 62/343,994 entitled “SYSTEM AND METHOD FORRECOGNITION OF ANNOTATION OF FACIAL EXPRESSIONS” filed Jun. 1, 2016. Theentirety of the above-noted applications are incorporated by referenceherein.

GOVERNMENT LICENSE RIGHTS

This invention was made with government support under Grant No.R01-EY-020834 and R01-DC-014498 awarded by the National Eye Instituteand the National Institute of Deafness and Other CommunicativeDisorders; both institutes are part of the National Institutes ofHealth. The government has certain rights in the invention.

BACKGROUND

Basic research in face perception and emotion theory can leverage largeannotated databases of images and video sequences of facial expressionsof emotion. Some of the most useful and typically needed annotations areAction Units (AUs), AU intensities, and emotion categories. While smalland medium size databases can be manually annotated by expert codersover several months, large databases cannot. For example, even if itwere possible to annotate each face image very fast by an expert coder(say, 20 seconds/image), it would take 5,556 hours to code a millionimages, which translates to 694 (8 hour) working days or 2.66 years ofuninterrupted work.

Existing algorithms either do not recognize all the AUs for allapplications, do not specify AU intensity, are too computationaldemanding in space and/or time to work with large databases, or are onlytested within specific databases (e.g., even when multiple databases areused, training and testing is generally done within each databaseindependently).

SUMMARY

The present disclosure provides a computer vision and machine learningprocess for the recognition of action units (AUs), their intensities,and a large number (23) of basic and compound emotion categories acrossdatabases. Crucially, the exemplified process is the first to providereliable recognition of AUs and their intensities across databases andruns in real-time (>30 images/second). The capabilities facilitate theautomatic annotations of a large database of a million facialexpressions of emotion images “in the wild,” a feat not attained by anyother system.

Additionally, images are annotated semantically with 421 emotionkeywords.

A computer vision process for the recognition of AUs and AU intensitiesin images of faces is presented. Among other things, the instant processcan reliably recognize AUs and AU intensities across databases. It isalso demonstrated herein that the instant process can be trained usingseveral databases to successfully recognize AUs and AU intensities on anindependent database of images not used to train our classifiers. Inaddition, the instant process is used to automatically construct andannotate a large database of images of facial expressions of emotion.Images are annotated with AUs, AU intensities and emotion categories.The result is a database of a million images that can be readily queriedby AU, AU intensity, emotion category and/or emotive keyword.

In addition, the instant process facilitates a comprehensive computervision processes for the identification of AUs from color features. Tothis end, color features can be successfully exploited for therecognition of AU, yielding results that are superior to those obtainedwith the previous mentioned system. That is, the functions definingcolor change as an AU goes from inactive to active or vice-versa areconsistent within AUs and differential between them. In addition, theinstant process reveals how facial color changes can be exploited toidentify the presence of AUs in videos filmed under a large variety ofimage conditions.

Additionally, facial color is used to determine the emotion of thefacial expression. As described above, facial expressions of emotion inhumans are produced by contracting one's facial muscles, generallycalled Action Units (AUs). Yet, the surface of the face is alsoinnervated with a large network of blood vessels. Blood flow variationsin these vessels yield visible color changes on the face. For example,anger increases blood flow to the face, resulting in red faces, whereasfear is associated with the drainage of blood from the face, yieldingpale faces. These visible facial colors allow for the interpret ofemotion in images of facial expressions even in the absence of facialmuscle activation. This color signal is independent from that providedby AUs, allowing our algorithms to detect emotion from AUs and colorindependently.

In addition, a Global-Local loss function for Deep Neural Networks(DNNs) is presented that can be efficiently used in fine-graineddetection of similar object landmark points of interest as well as AUsand emotion categories. The derived local and global loss yieldsaccurate local results without the need to use patch-based approachesand results in fast and desirable convergences. The instant Global-Localloss function may be used for the recognition of AUs and emotioncategories.

In some embodiments, the facial recognition and annotation processes areused in clinical applications.

In some embodiments, the facial recognition and annotation processes areused in detection of evaluation of psychopathologies.

In some embodiments, the facial recognition and annotation processes areused in screening for post-traumatic stressed disorder. e.g., in amilitary setting or an emergency room.

In some embodiments, the facial recognition and annotation processes areused for teaching children with learning disorders (e.g., AutismSpectrum Disorder) to recognize facial expressions.

In some embodiments, the facial recognition and annotation processes areused for advertising, e.g., for analysis of people looking at ads; foranalysis of people viewing a movie; for analysis of people's responsesin sport arenas.

In some embodiments, the facial recognition and annotation processes areused for surveillance.

In some embodiments, the recognition of emotion, AUs and otherannotations are used to improve or identify web searches, e.g., thesystem is used to identify images of faces expressing surprise or theimages of a particular person with furrowed brows.

In some embodiments, the facial recognition and annotation processes areused in retail to monitor, evaluate or determine customer behavior.

In some embodiments, the facial recognition and annotation processes areused to organize electronic pictures of an institution or individual,e.g., organize the personal picture of a person by emotion or AUs.

In some embodiments, the facial recognition and annotation processes areused to monitor patients' emotions, pain and mental state in a hospitalor clinical setting, e.g., to determine the level of discomfort of apatient.

In some embodiments, the facial recognition and annotation process areused to monitor a driver's behavior and attention to the road and othervehicles.

In some embodiments, the facial recognition and annotation process areused to automatically select emoji, stickers or other texting affectivecomponents.

In some embodiments, the facial recognition and annotation process areused to improve online surveys. e.g., to monitor emotional responses ofonline survey participants.

In some embodiments, the facial recognition and annotation process areused in online teaching and tutoring.

In some embodiments, the facial recognition and annotation process areused to determine the fit of a job application is a specific company,e.g., a company may be looking for attentive participants, whereasanother may be interested in joyful personalities. In another example,the facial recognition and annotation process are used to determine thecompetence of an individual during a job interview or on an online videoresume.

In some embodiments, the facial recognition and annotation process areused in gaming.

In some embodiments, the facial recognition and annotation process areused to evaluate patients' responses in a psychiatrist office, clinic orhospital.

In some embodiments, the facial recognition and annotation process isused to monitor babies and children.

In an aspect, a computer-implemented method is disclosed (e.g., foranalyzing an image to determine AU and AU intensity, e.g., inreal-time). The method comprises maintaining, in memory (e.g.,persistent memory), one or a plurality of kernel vector spaces (e.g.,kernel vector space) of configural or other shape features and ofshading features, wherein each kernel space is associated with one orseveral action units (AUs) and/or an AU intensity value and/or emotioncategory; receiving an image (e.g., an image of a facial expressionsexternally or from one or multiple databases) to be analyzed; and foreach received image: i) determining face space data (e.g., face vectorspace) of configural, shape and shading features of a face in the image(e.g., wherein the face space includes a shape feature vector of theconfigural features and a shading feature vector associated with shadingchanges in the face); and ii) determining one or more AU values for theimage by comparing the determined face space data of configural featuresto the plurality of kernel spaces to determine presence AUs, AUintensities and emotion categories.

In some embodiments, the method includes processing, in real-time, avideo stream comprising a plurality of images to determine AU values andAU intensity values for each of the plurality of images.

In some embodiments, the face space data includes a shape feature vectorof the configural features and a shading feature vector associated withshading changes in the face.

In some embodiments, the determined face space of the configural, shapeand shading features comprises i) distance values (e.g., Euclideandistances) between normalized landmarks in Delaunay triangles formedfrom the image and ii) distances, areas and angles defined by each ofthe Delaunay triangles corresponding to the normalized facial landmarks.

In some embodiments, the shading feature vector associated with shadingchanges in the face are determined by: applying Gabor filters tonormalized landmark points determined from the face (e.g., to modelshading changes due to the local deformation of the skin).

In some embodiments, the shape feature vector of the configural featurescomprises landmark points derived using a deep neural network (e.g., aconvolution neural network, DNN) comprising a global-local (GL) lossfunction configured to backpropagate both local and global fit oflandmark points projected and/or AUs and/or emotion categories over theimage.

In some embodiments, the method includes for each received image: i)determining a face space associated with a color features of the face;and ii) determining one or more AU values for the image by comparingthis determined color face space to the plurality of color or kernelvector spaces, iii) modify the color of the image such that the faceappears to express a specific emotion or have one or more AUs active orat specific intensities.

In some embodiments, the AU value and AU intensity value, collectively,define an emotion and an emotion intensity.

In some embodiments, the image comprises a photograph.

In some embodiments, the image comprises a frame of a video sequence.

In some embodiments, the image comprises an entire video sequence.

In some embodiments, the method includes receiving an image of a facialexpression in the wild (e.g., the Internet); and processing the receivedimage to determine an AU value and an AU intensity value and an emotioncategory for a face in the received image.

In some embodiments, the method includes receiving a first plurality ofimages from a first database; receiving a second plurality of imagesfrom a second database: and processing the received first plurality andsecond plurality of images to determine for each image thereof an AUvalue and an AU intensity value for a face in each respective image,wherein the first plurality of images has a first captured configurationand the second plurality of images has a second captured configuration,wherein the first captured configuration is different form the secondcaptured configuration (e.g., wherein captured configuration includeslighting scheme and magnitude; image background; focal plane; captureresolution; storage compression level; pan, tilt, and yaw of the capturerelative to the face, etc.).

In another aspect, a computer-implemented method is disclosed (e.g., foranalyzing an image to determine AU, AU intensity and emotion categoryusing color variation in the image). The method includes identifychanges defining transition of an AU from inactive to active, whereinthe changes are selected from the group consisting of chromaticity, hueand saturation, and luminance; and applying Gabor transform to theidentified chromaticity changes (e.g., to gain invariance to the timingof this change during a facial expression).

In another aspect, a computer-implemented method is disclosed foranalyzing an image to determine AU and AU intensity. The method includesmaintaining, in memory (e.g., persistent memory), a plurality of colorfeatures data associated with an AU and/or AU intensity; receiving animage to be analyzed; and for each received image: i) determiningconfigural color features of a face in the image: and ii) determiningone or more AU values for the image by comparing the determinedconfigural color features to the plurality of trained color feature datato determine presence of the determined configural color features in oneof more of the plurality of trained color feature data.

In another aspect, a computer-implemented method is disclosed (e.g., forgenerating a repository of plurality of face space data each associatedwith an AU value and an AU intensity value, wherein the repository isused for the classification of face data in an image or video frame forAU and AU intensity). The method includes analyzing a plurality of facesin images or video frames to determine kernel space data for a pluralityof AU values and AU intensity values, wherein each kernel space data isassociated with a single AU value and a single AU intensity value, andwherein each kernel space is linearly or non-linearly separable to otherkernel face space.

In some embodiments, the step of analyzing the plurality of faces todetermine kernel space data comprises: generating a plurality oftraining set of AUs for a pre-defined number of AU intensity values; andperforming kernel subclass discriminant analysis to determine aplurality of kernel spaces, each of the plurality of kernel spacescorresponding to kernel vector data associated with a given AU value, AUintensity value, an emotion category and the intensity of that emotion.

In some embodiments, the kernel space includes functional color spacefeature data of an image or a video sequence.

In some embodiments, the functional color space is determined byperforming a discriminant functional learning analysis (e.g., using amaximum-margin functional classifier) on color images each derived froma given image of plurality of images.

In another aspect, a non-transitory computer readable medium isdisclosed. The computer readable medium has instructions stored thereon,wherein the instructions, when executed by a processor, cause theprocessor to perform any of the methods described above.

In another aspect, a system is disclosed. The system comprises aprocessor and computer readable medium having instructions storedthereon, wherein the instructions, when executed by a processor, causethe processor to perform any of the methods described above.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating the output of a computer vision processto automatically annotate emotion categories and AUs in face images inthe wild.

FIG. 2, comprising FIGS. 2A and 2B, is an illustration of detected facelandmarks and a Delaunay triangulation of an image.

FIG. 3 is a diagram showing a hypothetical model in which sample imageswith active AUs are divided into subclasses.

FIG. 4 illustrates an example component diagram of a system using Gabortransforms to determine AUs and emotion categories.

FIG. 5 illustrates a color variance system for detecting AUs using colorfeatures in video and/or still images.

FIG. 6 illustrates a color variance system for detecting AUs using colorfeatures in video and/or still images.

FIG. 7 illustrates a network system for detecting AUs using deep neuralnetworks in video and/or still images

FIG. 8 shows an example computer system.

DETAILED DESCRIPTION

Real-Time Algorithm for the Automatic Annotation of a Million FacialExpressions in the Wild

FIG. 1 depicts a resulting database of facial expressions that can bereadily queried (e.g. sorted, organized, and/or the like) by AU, AUintensity, emotion category, or emotion/affect keyword. The databasefacilitates the design of new computer vision algorithms as well asbasic, translational and clinical studies in social and cognitivepsychology, social and cognitive neuroscience, neuro-marketing,psychiatry, and/or the like.

The database is compiled of outputs of a computer vision system thatautomatically annotates emotion category and AU in face images in thewild (i.e. images not already curated in an existing database). Theimages can be downloaded using a variety of web search engines byselecting only images with faces and with associated emotion keywords inWordNet or other dictionary. FIG. 1 shows three example queries of thedatabase. The top example is the results of two queries obtained whenretrieving all images that have been identified as happy and fearful.Also shown is the number of images in the database of images in the wildthat were annotated as either happy or fearful. The third depicted queryshows the results of retrieving all images with AU 4 or 6 present, andimages with the emotive keyword “anxiety” and “disapproval.”

AU and Intensity Recognition

In some embodiments, the system for recognizing AUs may process at over30 images/second and is determined to be highly accurate acrossdatabases. The system achieves high recognition accuracies acrossdatabases and can run in real time. The system can facilitatecategorizing facial expressions within one of twenty-three basic and/orcompound emotion categories. Categorization of emotion is given by thedetected AU pattern of activation. In some embodiments, an image(s) maynot belong to one of the 23 categories. When this is the case, the imageis annotated with AUs without an emotion category. If an image does nothave any AU active, the image is classified as a neutral expression. Inaddition to determining emotions and emotion intensities in faces, theexemplified processes maybe used to identify the “not face” in an image.

Face Space for AU and Intensity Recognition

The system starts by defining a feature space employed to represent AUsin face images. Perception of faces, and facial expressions inparticular, by humans is known to involve a combination of shape andshading analyses. The system can define shape features that facilitateperception of facial expressions of emotion. The shape features can besecond-order statistics of facial landmarks (i.e., distances and anglesbetween landmark points in the face image). The features canalternatively be called configural features, because the features definethe configuration of the face. It is appreciated that the terms may beused interchangeable in this application.

FIG. 2(a) shows examples of normalized face landmarks Ŝ_(ij) (j=1, . . ., 66) used by the proposed algorithm, several (e.g., fifteen) of thelandmarks can correspond to anatomical landmarks (e.g., corners of theeyes, mouth, eyebrows, tip of the nose, and chin). Other landmarks canbe pseudo-landmarks defining the edge of the eyelids, mouth, brows, lipsand jaw line as well as the midline of the nose going from the tip ofthe nose to the horizontal line given by the center of the two eyes. Thenumber of pseudo-landmarks defining the contour of each facial component(e.g., brows) is constant, which provides equivalency of landmarkposition for different faces or people.

FIG. 2(b) shows a Delaunay triangulation performed by the system. Inthis example, the number of triangles in this configuration is 107. Alsoshown in the image are the angles of the vector θ_(a)=(θ_(a1), . . . ,θ_(aqa))^(T) (with q_(a)=3), which define the angles of the trianglesemanating from the normalized landmark Ŝ_(ija).

S_(ij)=(s_(ij1) ^(T), . . . , s_(ijp) ^(T)) can be a vector of landmarkpoints in the j^(th) sample image (j=1, . . . , n_(i)) of AU i, wheres_(ijk)∈R² are the 2D image coordinates of the k^(th) landmark, andn_(i) is the number of sample images with AU i present. In someembodiments, the face landmarks can be obtained with computer visionalgorithms. For example, the computer vision algorithms can be used toautomatically detect any number of landmarks (for example 66 detectedlandmarks in a test image) as shown in FIG. 2a , where s_(ijk)∈R¹³² whenthe number of landmark points is 66.

The training images can be normalized to have the same inter-eyedistance of τ pixels. Specifically, s_(ijk)=c s_(ij), where c=τ/∥1−r∥₂,1, and r are the image coordinates of the center of the left and righteye, ∥.∥₂ defines the 2-norm of a vector, ŝ_(ij)=(ŝ_(ij1) ^(T), . . . ,ŝ_(ijp) ^(T)) and τ=300. The location of the center of each eye can bereadily computed as the geometric mid-point between the landmarksdefining the two corners of the eye.

The shape feature vector of configural features can be defined as,x _(ij)=(d _(ij12) , . . . ,d _(ijp-1 p),θ₁ ^(T), . . . ,θ_(p)^(T))^(T)  (Eqn. 1)where d_(ijab)=∥ŝ_(ija) ^(T)−ŝ_(ijb) ^(T)∥₂ are the Euclidean distancesbetween normalized landmarks, a=1, . . . , p−1, b=a+1, p andθ_(a)=(θ_(a1), . . . , θ_(aqa))^(T) are the angles defined by each ofthe Delaunay triangles emanating from the normalized landmark ŝ_(iji),with q_(a) the number of Delaunay triangles originating at ŝ_(ija) andΣ_(k=1) ^(q) ^(a) θ_(ak)≤360° (the equality holds for non-boundarylandmark points). Since each triangle in this figure can be defined bythree angles and in this example, there are 107 triangles, the totalnumber of angles in our shape feature vector is 321. The shape featurevectors can be

${x_{ij} \in R^{\frac{p{({p - 1})}}{2} + {3t}}},$where p is the number of landmarks and t the number of triangles in theDelaunay triangulation. In this example, with p=66 and t=107, vectorsx_(ij)∈R^(2,466).

The system can employ Gabor filters that are centered at each of thenormalized landmark points ŝ_(ijk) to model shading changes due to thelocal deformation of the skin. When a facial muscle group deforms theskin of the face locally, the reflectance properties of the skin change(e.g. the skin's bidirectional reflectance distribution function isdefined as a function of the skin's wrinkles because this changes theway light penetrates and travels between the epidermis and the dermisand may also vary hemoglobin levels) as well as foreshortening the lightsource as seen from a point on the surface of the skin.

Cells in early visual cortex in humans can be modelled by the systemusing the Gabor filters. Face perception can use Gabor-like modeling togain invariance to shading changes such as those seen when expressingemotions. It can be defined as

$\begin{matrix}{{{g\left( {{{\hat{s}}_{ijk};\lambda},\alpha,\phi,\gamma} \right)} = {{\exp\left( \frac{s_{1}^{2} + {\gamma^{2}s_{2}^{2}}}{2\sigma^{2}} \right)}{\cos\left( {{2\pi\frac{s_{1}}{\lambda}} + \phi} \right)}}},} & (2)\end{matrix}$

with ŝ_(ijk)=(ŝ_(ijk1), ŝ_(ijk2))^(T), s₁=ŝ_(ijk1) cos α+ŝ_(ijk2) sin α,s₂=−ŝ_(ijk1) sin α+ŝ_(ijk2) cos α, λ the wavelength (i.e., number ofcycles/pixel), a the orientation (i.e., the angle of the normal vectorof the sinusoidal function), φ the phase (i.e., the offset of thesinusoidal function), γ the (spatial) aspect ratio, and σ the scale ofthe filter (i.e., the standard deviation of the Gaussian window).

In some embodiments, a Gabor filter bank can be used with oorientations, s spatial scales, and r phases. In an example Gaborfilter, the following are set: λ={4,4√{square root over (2)}, 4×2,4(2√{square root over (2)}), 4(2×2)}={4,4√{square root over (2)}, 8,8√{square root over (2)}, 16} and γ=1. The values are appropriate torepresent facial expressions of emotion. The values of o, s, and r arelearned using cross-validation on the training set. The following setare possible values α={4,6,8,10}, σ={λ/4, λ/2, 3λ/4, λ} and φ={0, 1, 2}and use 5-fold cross-validation on the training set to determine whichset of parameters best discriminates each AU in our face space.

I_(ij) is the j^(th) sample image with AU i present and defined asg _(ijk)=(g(ŝ _(ijk);λ₁,α₁,φ₁,γ)*I _(ij),g(ŝ _(ijl);λ₅,α⁰,φ_(r),γ)*Iij)^(T),  (3)as the feature vector of Gabor responses at the k^(th) landmark points,where * defines the convolution of the filter g(.) with the imageI_(ij), and λ_(k) is the k^(th) element of the set λ defined above; thesame applies to α_(k) and φ_(k), but not to γ since this is generally 1.

The feature vector of the Gabor responses on all landmark points for thej^(th) sample image with AU i active are defined as:g _(ij)=(g _(ij1) ^(T) , . . . ,g _(ijp) ^(T))^(T)  (4)

The feature vectors define the shading information of the local patchesaround the landmarks of the face and their dimensionality isg_(ij)∈R^(5×p×o×x×r).

The final feature vectors defining the shape and shading changes of AU iin face space are defined asz _(ij)=(x _(ij) ^(T) ,g _(ij) ^(T))^(T) ,j=1, . . . ,n _(i).  (5)

Classification in Face Space for AU and Intensity Recognition

The system can define the training set of AU i asD _(i)={(z _(il) ,y _(il)), . . . ,(z _(i n) _(i) ,y _(i n) _(i) ),z_(i n) _(i) ₊₁ ,y _(i n) _(i) ₊₁), . . . ,(z _(i n) _(i) _(+m) _(i) ,y_(i n) _(i) _(+m) _(i) ))}  (6)where y_(ij)=1 for j=1, . . . , n_(k), indicating that AU i is presentin the image, y_(ij)=0 for j=n_(i)+1, . . . , n_(i)+m_(i), indicatingthat AU i is not present in the image, and m_(i) is the number of sampleimages that do not have AU i active.

The training set above is also ordered as follows. The setD _(i)(a)={(z _(il) ,y _(il)), . . . ,(z _(i n) _(ia) ,y _(i n) _(ia))}  (7)includes the n_(ia) samples with AU i active at intensity a (that is thelowest intensity of activation of an AU), the set

i(b)={(z _(i n) _(ia) ₊₁ ,y _(i n) _(ia) ₊₁), . . . ,(z _(i n) _(ia)_(+n) _(ib) ,y _(i n) _(ia) _(+n) _(ia) )}  (8)are the n_(ib) samples with AU i active at intensity b (which is thesecond smallest intensity).

The set:

i(c)={(z _(i n) _(ia) _(+n) _(ib) ₊₁ ,y _(i n) _(ia) _(+n) _(ib) ₊₁), .. . ,(z _(i n) _(ia) _(+n) _(ib) _(+n) _(ic) ,y _(i n) _(ia) _(+n) _(ib)_(+n) _(ic) )}  (9)are the n_(ic) samples with AU i active at intensity c (which is thenext intensity).

The set:

i(d)={(z _(i n) _(ia) _(+n) _(ib) _(+n) _(ic) ₊₁ ,y _(i n) _(ia) _(+n)_(ib) _(+n) _(ic) ₁), . . . ,(z _(i n) _(ia) _(+n) _(ib) _(+n) _(ic)_(+n) _(id) ,y _(i n) _(ia) _(+n) _(ib) _(+n) _(ic) _(+n) _(id) )}  (10)

are the n_(id) samples with AU i active at intensity d (which is thehighest intensity), and n_(ia)+n_(ib)+n_(ic)+n_(id)=n_(i).

An AU can be active at five intensities, which can be labeled a, b, c,d, or e. In some embodiments, there are rare examples with intensity eand, hence, in some embodiments, the four other intensities aresufficient. Otherwise, D_(i)(e) defines the fifth intensity.

The four training sets defined above are subsets of D_(i) and can berepresented as different subclasses of the set of images with AU iactive. In some embodiments, a subclass-based classifier can be used. Insome embodiments, the system utilizes Kernel Subclass DiscriminantAnalysis (KSDA) to derive instant processes. KSDA can be used because itcan uncover complex non-linear classification boundaries by optimizingthe kernel matrix and number of subclasses. The KSDA can optimize aclass discriminant criterion to separate classes optimally. Thecriterion is formally given byQ_(i)(ϕ_(i),h_(i1),h_(i2))=Q_(i1)(ϕ_(i),h_(i1),h_(i2))Q_(i2)(ϕ_(i),h_(i1),h_(i2)),with Q_(i1)(ϕ_(i),h_(i1),h_(i2)) responsible for maximizinghomoscedasticity. The goal of the kernel map is to find a kernel space Fwhere the data is linearly separable, in some embodiments, thesubclasses can be linearly separable in F, which is the case when theclass distributions share the same variance, andQ_(i2)(ϕ_(i),h_(i1),h_(i2)) maximizes the distance between all subclassmeans (i.e., which is used to find the Bayes classifier with smallestBayes error).

To see this recall that the Bayes classification boundary is given in alocation of feature space where the probabilities of the two Normaldistributions are identical (i.e., p(z|N(μ₁, Σ₁))=p(z|N(μ₂, Σ₂)), whereN(μ_(i), Σ_(i)) is a Normal distribution with mean μ_(i) and covariancematrix Σ_(i). Separating the means of two Normal distributions decreasesthe value where this equality holds, i.e., the equality p(x|N(μ₁,Σ₁))=p(x|N(μ₂, Σ₂)) is given at a probability values lower than beforeand, hence, the Bayes error is reduced.

Thus, the first component of the KSDA criterion presented above is givenby,

$\begin{matrix}{{Q_{i\; 1}\left( {\varphi_{i},h_{i\; 1},h_{i\; 2}} \right)} = {\frac{1}{h_{i\; 1}h_{i\; 2}}{\sum\limits_{c = 1}^{h_{i\; 1}}{\sum\limits_{d = h_{i\; 1}}^{h_{i\; 1} + h_{i\; 2}}\frac{{tr}\left( {\sum_{ic}^{\varphi_{i}}\sum_{id}^{\varphi_{i}}} \right)}{{{tr}\left( \sum_{ic}^{\varphi_{i}^{2}} \right)}{{tr}\left( \sum_{id}^{\varphi_{i}^{2}} \right)}}}}}} & (11)\end{matrix}$

where Σ_(it) ^(φi) is the subclass covariance matrix (i.e., thecovariance matrix of the samples in subclass l) in the kernel spacedefined by the mapping function φ_(i)(.): R^(e)→F, h_(i1) is the numberof subclasses representing AU i is present in the image, h_(i2) is thenumber of subclasses representing AU i is not present in the image, andrecall e=3t+p(p−1)/2+5×p×o×s×r is the dimensionality of the featurevectors in the face space defined in Section relating to Face Space.

The second component of the KSDA criterion is,

$\begin{matrix}{{{Q_{i\; 2}\left( {\varphi_{i},h_{i\; 1},h_{i\; 2}} \right)} = {\sum\limits_{c = 1}^{h_{i\; 1}}{\sum\limits_{d = {h_{i\; 1} + 1}}^{h_{i\; 1} + h_{i\; 2}}{p_{ic}p_{id}{{\mu_{ic}^{\varphi_{i}} - \mu_{id}^{\varphi_{i}}}}_{2}^{2}}}}},} & (12)\end{matrix}$where p_(il)=n_(l)/n_(i) is the prior of subclass l in class i (i.e.,the class defining AU i), n_(l) is the number of samples in subclass l,and μ_(il) ^(φi) is the sample mean of subclass l in class i in thekernel space defined by the mapping function φ_(i)(.).

For example, the system can define the mapping functions ϕ_(i)(.) usingthe Radial Basis Function (RBF) kernel,

$\begin{matrix}{{k\left( {z_{{ij}_{1}},z_{{ij}_{2}}} \right)} = {\exp\left( {- \frac{{{z_{{ij}_{1}} - z_{{ij}_{2}}}}_{2}^{2}}{v_{i}}} \right)}} & (13)\end{matrix}$where v_(i) is the variance of the RBF, and j₁,j₂=1, . . . ,n_(i)+m_(i). Hence, the instant KSDA-based classifier is given by thesolution to:

$\begin{matrix}{v_{i}^{*},h_{i\; 1}^{*},{h_{i\; 2}^{*} = {\arg\;{\max\limits_{v_{i},h_{i\; 1},h_{i\; 2}}{Q_{i}\left( {v_{i},h_{i\; 1},h_{i\; 2}} \right)}}}}} & (14)\end{matrix}$

FIG. 3 depicts a solution of the above equation to yield a model for AUi. In the hypothetical model shown above, the sample images with AU 4active are first divided into four subclasses, with each subclassincluding the samples of AU 4 at the same intensity of activation (a-e).Then, the derived KSDA-based approach uses a process to furthersubdivide each subclass into additional subclasses to find the kernelmapping that intrinsically maps the data into a kernel space where theabove Normal distributions can be separated linearly and are as farapart from each other as possible.

To do this, the system divides the training set D, into five subclasses.The first subclass (i.e., l=1) includes the sample feature vectors thatcorrespond to the images with AU i active at intensity a, that is, theD_(i)(a) defined in S. Du, Y. Tao, and A. M. Martinez. “Compound facialexpressions of emotion” Proceedings of the National Academy of Sciences,111(15):E1454-E1462, 2014, which is incorporated by reference herein inits entirety. The second subclass (l=2) includes the sample subset.Similarly, the third and fourth subclass (l=2, 3) include the samplesubsets, respectively. Finally, the five subclass (l=5) includes thesample feature vectors corresponding to the images with AU i not active,i.e.,

_(i)(not active)={(z _(i n) _(i) ₊₁ ,y _(i n) _(i) ₊₁), . . . , (z_(i n) _(i) _(+m) _(i) ,y _(i n) _(i) _(+m) _(i) )}.  (15)

Thus, initially, the number of subclasses to define AU i active/inactiveis five (i.e., h_(i1)=4 and h_(i2)=1). In some embodiments, this numbercan be larger: for example, if images at intensity e are considered.

Optimizing Equation 14 may yield additional subclasses. The derivedapproach optimizes the parameter of the kernel map v_(i) as well as thenumber of subclasses h_(i1) and h_(i2). In this embodiment, the initial(five) subclasses can be further subdivided into additional subclasses.For example, when no kernel parameter v_(i) can map the non-linearlyseparable samples in D_(i)(a) into a space where these are linearlyseparable from the other subsets, D_(i)(a) is further divided into twosubsets D_(i)(a)={D_(i)(a₁),D_(i)(a₂)}. This division is simply given bya nearest-neighbor clustering. Formally, let the sample z_(ij+1) be thenearest-neighbor to z_(ij), then the division of D_(i)(a) is readilygiven by,

_(i)(a ₁)={(z _(i1) ,y _(i1)), . . . ,(z _(i n) _(a) _(/2) ,y _(i n)_(a) _(/2))}

_(i)(a ₂)={(z _(i n) _(a) _(/2+1) ,y _(i n) _(a) _(/2+1)), . . . ,(z_(in) _(a) ,y _(in) _(a) )}  (16)

The same applies to D_(i)(b), D_(i)(c), D_(i)(d), D_(i)(e) and D_(i)(notactive). Thus, optimizing Equation 14 can result in multiple subclassesto model the samples of each intensity of activation or non-activationof AU i, e.g., if subclass one (l=1) defines the samples in D_(i)(a) andthe system divides this into two subclasses (and currently h_(i1)=4),then the first new two subclasses will be used to define the samples inD_(i)(a), with the first subclass (l=1) including the samples inD_(i)(a₁) and the second subclass (l=2) those in D_(i)(a₂) (and h_(i1)will now be 5). Subsequent subclasses will define the samples inD_(i)(b), D_(i)(c), D_(i)(d), D_(i)(e) and D_(i)(not active) as definedabove. Thus, the order of the samples as given in D, never changes withsubclasses 1 through h_(i1) defining the sample feature vectorsassociated to the images with AU i active and subclasses h_(i1)+1through h_(i1)+h_(i2) those representing the images with AU i notactive. This end result is illustrated using a hypothetical example inFIG. 3.

In one example, every test image in a set of images I_(test) can beclassified. First, I_(test) includes a feature representation in facespace vector z_(test) that is computed in relation to Face Space asdescribed above. Second, the vector is projected into the kernel spaceand called z^(φ) _(test) To determine if this image has AU i active, thesystem computes the nearest mean,

$\begin{matrix}{{j^{*} = {\arg\;{\min\limits_{i}{{z_{test}^{\varphi_{i}} - \mu_{ij}^{\varphi_{i}}}}_{2}}}},{j = 1},\ldots\mspace{14mu},{h_{i\; 1} + h_{i\; 2}}} & (17)\end{matrix}$

If j*≤h_(i1), then I_(test) is labeled as having AU i active; otherwise,it is not.

The classification result provides intensity recognition. If the samplesrepresented by subclass l are a subset of those in D_(i)(a), then theidentified intensity is a. Similarly, if the samples of subclass l are asubset of those in D_(i)(b), D_(i)(c), D_(i)(d) or D_(i)(e), then theintensity of AU i in the test image I_(test) is b, c, d and e,respectively. Of course, if j*>h_(i1), the images does not have AU ipresent and there is no intensity (or, one could say that the intensityis zero).

FIG. 4 illustrates an example component diagram of a system 400 toperform the functions described above with respect to FIGS. 1-3. Thesystem 400 includes an image database component 410 having a set ofimages. The system 400 includes detector 420 to eliminate non-faceimages in the image database. Creates a subset of the set of images ofimages only including faces. The system 400 includes a training database430. The training database 430 is utilized by a classifier component 440to classify images into an emotion category. The system 400 includes atagging component 450 that tags the images with at least one AU and anemotion category. The system 400 can store the tagged images in aprocessed image database 460.

Discriminant Functional Learning of Color Features for the Recognitionof Facial Action Units

In another aspect, a system facilitates comprehensive computer visionprocesses for the identification of AUs using facial color features.Color features can be used to recognize AUs and Au intensities. Thefunctions defining color change as an AU goes from inactive to active orvice-versa are consistent within AUs and the differentials between them.In addition, the system reveals how facial color changes can beexploited to identify the presence of AUs in videos filmed under a largevariety of image conditions and externally of image database.

The system receives an i^(th) sample video sequence V_(i)={I_(i1), . . ., I_(ir) _(i) }, where r_(i) is the number of frames and I_(ik)∈R^(3qw)is the vectorized k^(th) color image of q×w RGB pixels. V_(i) can bedescribed as the sample function ƒ_(i)(t).

The system identifies a set of physical facial landmarks on the face andobtains local face regions using algorithms described herein. The systemdefines the landmark points in vector form as s_(ik)=(s_(ik1), . . . ,s_(ik66)), where i is the sample video index, k the frame number, ands_(ik1)∈R² are the 2D image coordinates of the l^(th) landmark, l=1, . .. , 66. For purposes of explanation, specific example values may be used(e.g. 66 landmarks, 107 image patches) in this description. It isappreciated that the values may vary according to image sets, faces,landmarks, and/or the like.

The system defines a set D_(ij)={d_(i1k), . . . , d_(i107k)} as a set of107 image patches d_(ijk) obtained with a Delaunay triangulation asdescribed above, where d_(ijk)∈R^(3q) _(ij) is the vector describing thej^(th) triangular local region of q_(ij) RGB pixels and, as above, ispecifies the sample video number (i=1, . . . , n) and k the frame (k=1,. . . , r_(i)).

In some embodiments, the size (i.e., number of pixels, q_(ij)) of theselocal (triangular) regions not only varies across individuals but alsowithin a video sequence of the same person. This is a result of themovement of the facial landmark points, a necessary process to produce afacial expression. The system defines a feature space that is invariantto the number of pixels in each of these local regions. The systemcomputes statistics on the color of the pixels in each local region asfollows.

The system computes the first and second (central) moments of the colorof each local region,

$\begin{matrix}{{\mu_{ijk} = {q_{ij}^{- 1}{\sum\limits_{p = 1}^{P}d_{ijkp}}}}{{\sigma_{ijk} = \sqrt{q_{ij}^{- 1}{\sum\limits_{p = 1}^{P}\left( {d_{ijkp} - \mu_{ijk}} \right)^{2}}}},}} & (18)\end{matrix}$

with d_(ijk)=(d_(ijk1), . . . , d_(ijkP))^(T) and μ_(ijk), σ_(ijk)∈R³.In some embodiments, additional moments are computed.

The color feature vector of each local patch can be defined as

$\begin{matrix}{{x_{ij} = \left( {\mu_{{ij}\; 1},\ldots\mspace{14mu},\mu_{ijr},\sigma_{{ij}\; 1},\ldots\mspace{14mu},\sigma_{{ijr}_{i}}} \right)^{T}},} & (19)\end{matrix}$where, i is the sample video index (V_(i)), j the local patch number andr_(i) the number of frames in this video sequence. This featurerepresentation defines the contribution of color in patch j. In someembodiments, other proven features can be included to increase richnessof the feature representation. For example, responses to filters orshape features.

Invariant Functional Representation of Color

The system can define the above computed color information as a functioninvariant to time, i.e., the functional representation is consistentregardless of where in the video sequence an AU becomes active.

The color function ƒ(.) that defines color variations of a videosequence V, and a template function ƒ_(T)(.) that models the colorchanges associated with the activation of an AU (i.e., from AU inactiveto active). The system determines if ƒ_(T)(.) is in ƒ(.).

In some embodiments, the system determines this by placing the templatefunction ƒ_(T)(.) at each possible location in the time domain of ƒ(.).This is typically called a sliding-window approach, because it involvessliding a window left and right until all possible positions of ƒ_(T)(.)have been checked.

In other embodiments, the system derives a method using a Gabortransform. The Gabor transform is designed to determine the frequencyand phase content of a local section of a function to derive analgorithm to find the matching of ƒ_(T)(.) in ƒ(.) without using asliding-window search.

In this embodiment, without loss of generality, ƒ(t) can be a functiondescribing one of the color descriptors, e.g., the mean of the redchannel in the j^(th) triangle of video i or the first channel in anopponent color representation. Then, the Gabor transform of thisfunction isG(t,ƒ)=∫_(−∞) ^(∞)ƒ(τ)g(τ−t)e ^(−2πjƒτ) dτ,  (20)

where g(t) is a concave function and

=√{square root over (−1)}. One possible pulse function may be defined as

$\begin{matrix}{{g(t)} = \left\{ {\begin{matrix}{1,} & {0 \leq t \leq L} \\{0,} & {otherwise}\end{matrix},} \right.} & (21)\end{matrix}$

where L is a fixed time length. Other pulse functions might be used inother embodiments. Using the two equations yields

$\begin{matrix}\begin{matrix}{{G\left( {t,f} \right)} = {\int_{t - L}^{t}{{f(\tau)}e^{{- 2}\pi_{j}f\;\tau}d\;\tau}}} \\{= {e^{{- 2}\pi_{j}{f{({t - L})}}}{\int_{0}^{L}{{f\left( {\tau + t - L} \right)}e^{{- 2}\pi_{j}f\;\tau}d\;{\tau.}}}}}\end{matrix} & (22)\end{matrix}$

as the definition of a functional inner product in the timespan [0, L]and, thus, G(.,.) can be written asG(t,ƒ)=e ^(−2πjƒ(i-L))(ƒ(τ+t−L),e ^(−2jπƒτ)),  (23)

where <., .> is the functional inner product. The Gabor transform aboveis continuous in time and frequency, in the noise-free case.

To compute the color descriptor of the i^(th) video, ƒ_(il)(t), allfunctions are defined in a color space spanned by a set of b basisfunctions ϕ(t)={ϕ₀(t), . . . , ϕ_(b-1)(t)}, with ƒ_(i) ₁ (t)=Σ_(a=0)^(b-1)c_(i) ₁ _(x)ϕ_(z)(t) and c_(i) ₁ =(c_(i) ₁ ₀, . . . , c_(i) ₁_(b-1))^(T) as the vector of coefficients. The functional inner productof two color descriptors can be defined as

$\begin{matrix}{{\left\langle {{f_{i_{1}}(t)},{f_{i_{2}}(t)}} \right\rangle = {{\sum\limits_{\forall_{i,q}}^{\;}{\int_{0}^{L}{c_{i_{1}r}{\phi_{i_{1}}(t)}c_{i_{2}q}{\phi_{i_{2}}(t)}{dt}}}} = {c_{i_{1}}^{T}{\Phi(t)}c_{i_{2}}}}},} & (24)\end{matrix}$

where Φ is a b×b matrix with elements Φ_(ij)=(ϕ_(i)(t), ϕ_(j)(t)).

In some embodiments, the model assumes that statistical color propertieschange smoothly over time and that their effect in muscle activation hasa maximum time span of L seconds. The basis functions that fit thisdescription are the first several components of the real part of theFourier series, i.e., normalized cosine basis. Other basis functions canbe used in other embodiments.

Cosine bases can be defined as ψ_(z)(t)=cos(2πzt), z=0, . . . , b−1. Thecorresponding normalized bases are defined as

$\begin{matrix}{{{\hat{\psi}}_{z}(t)} = {\frac{\psi_{z}(t)}{\sqrt{\left\langle {{\psi_{z}(t)},{\psi_{z}(t)}} \right\rangle}}.}} & (25)\end{matrix}$

The normalized basis set allows Φ=Id_(b), where Id_(b) denotes the b×bidentity matrix, rather than an arbitrary positive definite matrix.

The above derivations with the cosine bases makes the frequency spaceimplicitly discrete. The Gabor transform {tilde over (G)}(.,.) of colorfunctions becomes{tilde over (G)}(t,z)=

{tilde over (f)} _(i) ₁ (t),{circumflex over (ψ)}_(z)(t)

=c _(i) ₁ _(z) ,z=0, . . . ,b−1,  (26)

where ƒ _(i1)(t) is the computed function ƒ_(i1)(t) in the interval[t−L, t] and c_(ilz) is the z^(th) coefficient.

It is appreciated that the above-derived system does not include thetime domain, however, the time domain coefficients can be found andutilized, where needed.

Functional Classifier of Action Units

The system employs the Gabor transform derived above to define a featurespace invariant to the timing and duration of an AU. In the resultingspace, the system employs a linear or non-linear classifier. In someembodiment, a KSDA, a Support Vector Machine (SVM) or a Deep multilayerneural Network (DN) may be used as classifier.

Functional Color Space

The system includes functions describing the mean and standard deviationof color information from distinct local patches, which usessimultaneous modeling of multiple functions described below.

The system defines a multidimensional function Γ_(i)(t)=(γ_(i) ¹(t), . .. , γ_(i) ²(t))^(T), with each function γ_(z)(t) the mean or standarddeviation of a color channel in a given patch. Using the basis expansionapproach, each γ_(i) ^(e)(t) is defined by a set of coefficients c_(ie)and, thus, Γ_(i)(t) is given by:c _(i) ^(T)=[(c _(i) ¹)^(T), . . . ,(c _(i) ^(g))^(T)].  (27)

The inner product for multidimensional functions is redefined usingnormalized Fourier cosine bases to achieve

$\begin{matrix}{\left\langle {{\Gamma_{i}(t)},{\Gamma_{j}(t)}} \right\rangle = {{\sum\limits_{e = 1}^{g}{\left\langle {{\gamma_{i}^{e}(t)},{\gamma_{j}^{e}(t)}} \right\rangle{\sum\limits_{e = 1}^{g}{\left( c_{i}^{e} \right)^{T}c_{j}^{e}}}}} = {c_{i}^{T}{c_{j}.}}}} & (28)\end{matrix}$

Other bases can be used in other embodiments.

The system uses a training set of video sequences to optimize eachclassifier. It is important to note that the system is invariant to thelength (i.e., number of frames) of a video. Hence, the system does notuse alignment or cropping of the videos for recognition.

In some embodiments, the system can be extended to identify AU intensityusing the above approach and a multi-class classifier. The system can betrained to detect AU and each of the five intensities, a, b, c, d, and eand AU inactive (not present). The system can also be trained toidentify emotion categories in images of facial expressions using thesame approach described above.

In some embodiments, the system can detect AUs and emotion categories invideos. In other embodiments, the system can identify AUs in stillimages. To identify AUs in still images, the system first learns tocompute the functional color features defined above from a single imagewith regression. In this embodiment, the system regresses a functionh(x)=y to map an input image x into the required functionalrepresentation of color y.

Support Vector Machines

A training set is defined {(γ₁(t), y₁), . . . , (γ_(n)(t), y_(n))},where γ_(i) (t)∈H^(v), H^(v) is a Hilbert space of continuous functionswith bounded derivatives up to order v, and y_(i)∈{−1, 1} are theirclass labels, with +1 indicating that the AU is active and −1 inactive.

When the samples of distinct classes are linearly separable, thefunction w(t) that maximizes class separability is given by

$\begin{matrix}{{{J\left( {{w(t)},v,\xi} \right)} = {\min\limits_{{w{(t)}},v,\xi}\left\{ {{\frac{1}{2}\left\langle {{w(t)},{w(t)}} \right\rangle} + {C{\sum\limits_{i = 1}^{n}\xi_{i}}}} \right\}}}{{{{subject}\mspace{14mu}{to}\mspace{14mu}{y_{i}\left( {\left\langle {{w(t)},{\gamma_{i}(t)}} \right\rangle - v} \right)}} \geq {1 - \xi_{i}}},{\xi_{i} \geq 0},}} & (29)\end{matrix}$

where v is the bias and, as above, <γ_(i)(i),γ_(j)(t)>=∫γi(t)γj(t) dtdenotes the functional inner product ξ=(ξ₁, . . . , ξ_(n))^(T) are theslack variables, and C>0 is a penalty value found usingcross-validation.

Applied to our derived approach to model Γ_(i) using normalized cosinecoefficients jointly with (28), transforms (29) to the followingcriterion

$\begin{matrix}{{J\left( {w,v,\xi,\alpha} \right)}{\min\limits_{w,\xi,b,a}{\left\{ {{\frac{1}{2}w^{T}w} + {C{\sum\limits_{i = 1}^{n}\xi_{i}}} - {\sum\limits_{i = 1}^{n}{\alpha_{i}\left( {{y_{i}\left( {{w^{T}c_{i}} - v} \right)} - 1 + \xi_{i}} \right)}} - {\sum\limits_{i = 1}^{n}{\theta_{i}\xi_{i}}}} \right\}.}}} & (30)\end{matrix}$

where C>0 is a penalty value found using cross-validation.

The system projects the original color spaces onto the first several(e.g., two) principal components of the data. The principal componentsare given by Principal Components Analysis (PCA). The resulting pdimensions are labeled φ_(PCA) _(k) , k=1, 2, . . . , p.

Once trained, the system can detect AUs, AU intensities and emotioncategories in video in real time or faster than real time. In someembodiments, the system can detect AUs at greater than 30frames/second/CPU thread.

Deep Network Approach Using Multilayer Perceptron

In some embodiments, the system can include a deep network to identifynon-linear classifiers in the color feature space.

The system can train a multilayer perceptron network (MPN) using thecoefficients c_(i). This deep neural network is composed of several(e.g., 5) blocks of connected layers with batch normalization and somelinear or non-linear functional rectification, e.g., rectified linearunits (ReLu). To effectively train the network, the system uses dataaugmentation by super-sampling the minority class (AU active/AUintensity) or down-sampling the majority class (AU not active); thesystem may also use class weights and weight decay.

We train this neural network using gradient descent. The resultingalgorithm works in real time or faster than real time, >30frames/second/CPU thread.

AU Detection in Still Images

To apply the system to still images, the system specifies colorfunctions f_(i) of image I_(i). That is, the system defines the mappingh(I_(i))=f_(i) where f_(i) is defined by its coefficients c_(i) ^(T). Insome embodiments, the coefficients can be learned from training datausing non-linear regression.

The system utilizes a training set of m videos, {V₁, . . . , V_(m)}. Asabove, V_(i)={I_(i1), . . . , I_(iri)}. The system considers everysub-set of consecutive frames of length L (with L≤r_(i)), i.e.,W_(i1)=({I_(i1), . . . , I_(iL)}, W_(i2)={I_(i2), . . . , I_(i(L+1))}, .. . , W_(i(ri-L))={I_(i(ri-L)), . . . , I_(iri)}. The system computesthe color representations of all W_(tk) as described above. This yieldsx_(tk)=(x_(ilk), . . . , X_(i107k))^(T) for each W_(ik)=1, . . . ,r_(i)−L. Following (19)x _(ijk)=(μ_(ij1), . . . ,μ_(ijL),σ_(ij1), . . . ,σ_(ijL))^(T),  (31)

where i and k specify the video W_(ik), and j the patch, j=1, . . .,107.

The system computes the functional color representations f_(ijk) of eachW_(ik) for each of the patches, j=1, . . . ,107. This is done using theapproach detailed above to yield f_(ijk)=(c_(ijk1), . . . ,c_(ijkQ))^(T), where c_(ijkq) is the q^(th) coefficient of the j patchin video W_(ij). The training set is then given by the pairs {x_(ijk),ƒ_(ijk)}. The training set is used to regress the functionƒ_(ijk)=h(x_(ijk)). For example, let Î be a test image and {circumflexover (x)}_(j) its color representation in patch j. Regression is used toestimate the mapping from image to functional color representation, asdefined above. For example, Kemel Ridge Regression can be used toestimate the q^(th) coefficient of the test image as:ĉ _(jq) =C ^(T)(K+λId)⁻¹κ({circumflex over (x)} _(j)),  (32)

where {circumflex over (x)}_(j) is the color feature vector of thej^(th) patch C=(c_(1j1q), . . . , c_(mj(r) _(m) _(-L)q))^(T) is thevector of coefficients of the j^(th) patch in all training images. K isthe Kernel matrix, K(i,j)=k(x_(ijk), x_(ijk)) (i and î=1, . . . , m, kand {circumflex over (k)}=1, . . . , r_(i)−L), and κ({circumflex over(x)}_(j))=(k({circumflex over (x)}_(j), x_(1j1)), . . . , k({circumflexover (x)}_(j), x_(mj(r) _(m) _(-L))))^(T). The system can use the RadialBasis Function kernel, k(a, b; η)=exp(−η∥a−b∥²). In some embodiments,the parameters η and λ are selected to maximize accuracy and minimizemodel complexity. This is the same as optimizing the bias-variancetradeoff. The system uses a solution to the bias-variance problem asknown in the art.

As shown above, the system can use a regressor on previously unseen testimages. If Î is a previously unseen test image. Its functionalrepresentation is readily obtained as ĉ=h({circumflex over (x)}), withĉ=(c₁₁, . . . , c_(107Q))^(T). This functional color representation canbe directly used in the functional classifier derived above.

FIG. 5 illustrates a color variance system 500 for detecting AUs oremotions using color variance in video and/or still images. The system500 includes an image database component 510 having a set of videosand/or images. The system 500 includes a landmark component 520 thatdetects landmarks in the image database 510. The landmark component 520creates a subset of the set of images of images with defined landmarks.The system 500 includes a statistics component 530 that calculateschanges in color in a video sequence or statistics in a still image of aface. From the statistics component 530, AUs or emotions are determinedfor each video or image in the database component 510 as describedabove. The system 500 includes a tagging component 540 that tags theimages with at least one AU or no AU. The system 500 can store thetagged images in a processed image database 550.

Facial Color Used to Recognize Emotion in Images of Facial Expressionsand Edit Images of Faces to Make them Appear to Express a DifferentEmotion

In the methodology described above, the system used configural, shape,shading and color features to identify AUs. This is because, AUs defineemotion categories, i.e., a unique combination of AUs, specifies aunique emotion category. Nonetheless, facial color also transmitsemotion. Face can express emotion information to observers by changingthe blood flow on the network of blood vessels closest to the surface ofthe skin. Consider, for instance, the redness associated with anger orthe paleness in fear. These color patterns are caused by variations inblood flow and can occur even in the absence of muscle activation. Oursystem detects these color variations, which allows it to identifyemotion even in the absence of muscle action (i.e., regardless ofwhether AUs are present or not in the image).

Areas of the Face.

The system denotes each face color image of p by q pixels as I_(ij)∈

^(p×q×3) and the r landmark points of each of the facial components ofthe face as s_(ij)=(s_(ij1), . . . , s_(ijr))^(T), s_(ijk)∈

² the 2-dimensional coordinates of the landmark point on the image.Here, i specifies the subject and j the emotion category. In someembodiments, the system uses r=66. These fiducial points define thecontours of the internal and external components of the face, e.g.mouth, nose, eyes, brows, jawline and crest. Delaunay triangulation canbe used to create the triangular local areas defined by these faciallandmark points. This triangulation yields a number of local areas(e.g., 142 areas when using 66 landmark points). Let this number be a.

The system can define a function D={d₁, . . . , d_(a)}, as a set offunctions that return the pixels of each of these a local regions, i.e.,d_(k)(I_(ij)) is a vector including the l pixels within the k^(th)Delaunay triangle in image I_(ij), i.e., d_(k)(I_(ij))=(d_(ijk1), . . ., d_(ijkl))^(T)∈

^(3l), where d_(ijks)∈

³ defines the values of the three color channels of each pixel.

Color Space.

The above derivations divide each face image into a set of localregions. The system can compute color statistics of each of these localregions in each of the images. Specifically, the system computes firstand second moments (i.e., mean and variance) of the data, defined as

$\mu_{ijk} = {l^{- 1}{\sum\limits_{s = 1}^{l}d_{ijks}}}$$\sigma_{ijk} = {\sqrt{l^{- 1}{\sum\limits_{s = 1}^{l}\left( {d_{ijks} - \mu_{ijk}} \right)^{2}}}.}$

In other embodiments, additional moments of the color of the image areutilized. Every image I_(ij) is now represented using the followingfeature vector of color statistics, x_(ij)=(μ_(ij1) ^(T), σ_(ij1) ^(T),. . . , μ_(ij120) ^(T), σ_(ij120) ^(T))^(T)∈

⁷²⁰.

Using the same modeling, the system defines a color feature vector ofeach neutral face as x_(in)=(μ_(in1) ^(T), σ_(in1) ^(T), . . . ,μ_(in120) ^(T), σ_(in120) ^(T))^(T), where n indicates this featurevector corresponds to a neutral expression, not an emotion category. Theaverage neutral face is x _(n)=m⁻¹Σ_(i=1) ^(m)x_(in), with m the numberof identities in the training set. The color representation of a facialexpression of emotion is then given by its deviation from this neutralface, {circumflex over (x)}_(ij)=x_(ij)−x _(n).

Classification.

The system uses a linear or non-linear classifier to classify emotioncategories in the color space defined above. In some embodiments, LinearDiscriminant Analysis (LDA) is computed on the above-defined colorspace. In some embodiments, the color space can be defined byeigenvectors associated with a non-zero eigenvalues of the matrix Σ_(x)⁻¹S_(B), where Σ_(x)=Σ_(i=1) ^(m)Σ_(j=1) ¹⁸({circumflex over(x)}_(ij)−μ)({circumflex over (x)}_(ij)−μ)^(T)+δI is the (regularized)covariance matrix, S_(B)=Σ_(j=1) ^(C)(x _(j)−μ)(x _(j)−μ)^(T), x_(j)=m⁻¹Σ_(i=1) ^(m){circumflex over (x)}_(ij) are the class means,μ=(18 m)⁻¹ Σ_(i=1) ^(m)Σ_(j=1) ¹⁸{circumflex over (x)}_(ij), I is theidentity matrix, δ=0.01 the regularizing parameter, and C the number ofclasses.

In other embodiments, the system can employ Subclass DiscriminateAnalysis (SDA), KSDA, or Deep Neural Networks.

Multiwavy Classification

The selected classifier (e.g., LDA) is used to compute the color space(or spaces) of the C emotion categories and neutral. In someembodiments, the system is trained to recognize 23 emotion categories,including basic and compound emotions.

The system divides the available samples into ten different sets S={S₁,. . . , S₁₀}, where each subset S_(t) with the same number of samples.This division is done in a way that the number of samples in eachemotion category (plus neutral) is equal in every subset. The systemrepeats the following procedure with t=1, . . . ,10: All the subsets,except S_(t), are used to compute Σ_(x) and S_(B). The samples in subsetSt, which were not used to compute the LDA subspace ℑ, are projectedonto ℑ. Each of the test sample feature vectors t_(j)∈S_(t) is assignedto the emotion category of the nearest category mean, given by theEuclidean distance, e*_(j)=arg min_(e) (t_(j)−x _(e))^(T)(t_(j)−x _(e)).The classification accuracy over all the test samples t_(j)∈S_(t) isgiven by μ_(t) _(th) _(-fold)=n_(t) ⁻¹Σ_(∀x) _(j) _(∈S) _(t) 1_({e*)_(j) _(=y(t) _(j) _()}), where n_(t), is the number of samples in S_(t),y(t_(j)) is the oracle function that returns the true emotion categoryof sample t_(j), and 1_({e*) _(j) _(=y(t) _(j) _()}) is the zero-oneloss which equals one when e*_(j)=y(t_(j)) and zero elsewhere. Thus,S_(t) serves as the testing subset to determine the generalization ofour color model. Since t=1, . . . ,10, the system can repeat thisprocedure ten times, each time leaving one of the subsets S_(t) out fortesting, and then compute the mean classification accuracy as,μ_(10-fold)=0.1Σ_(t=1) ¹⁰μ_(t) _(th) _(-fold). The standard deviation ofthe cross-validated classification accuracies is σ_(10-fold)=√{squareroot over (0.1Σ_(t=1) ¹⁰(μ_(t) _(th) ^(-fold)−μ_(10-fold))²)}. Thisprocess allows the system to identify the discriminant color featureswith best generalizations, i.e., those that apply to images not includedin the training set.

In other embodiments, the system uses a 2-way (one versus all)classifier.

One-Versus-all Classification:

The system identifies the most discriminant color features of eachemotion category by repeating the approach described above C times, eachtime assigning the samples of one emotion category (i.e., emotioncategory c) to class 1 (i.e., the emotion under study) and the samplesof all other emotion categories to class 2. Formally,S_(c)={∀t_(j)|y(t_(j))=c} and S _(c)={∀t_(j)|y(t_(j))≠c}, with c=(1, . .. , C).

A linear or no-linear classifier (e.g., KSDA) is used to discriminatethe samples in S_(c) from those in S _(c).

Ten-fold cross-validation: The system uses the same 10-foldcross-validation procedure and nearest-mean classifier described above.

In some embodiments, to avoid biases due to the sample imbalance in thistwo class problem, the system can apply downsampling on S _(c). In somecases, the system repeat this procedure a number of times, each timedrawing a random sample from S _(c) to match the number of samples inS_(c).

Discriminant Color Model:

When using LDA as the 2-way classifier, Σ_(x) ⁻¹S_(B)V=VΛ provides theset of discriminant vectors V=(v₁ ^(T), . . . , v_(b) ^(T)), orderedfrom most discriminant to least discriminant, λ₁>λ₂= . . . =λ_(b)=0,where Λ=diag(λ₁, λ₂, . . . λ_(b)). The discriminant vector v₁=(v_(1,1),. . . , v_(1,720))^(T) defines the contributions of each color featurein discriminating the emotion category. The system may only keep the v₁since this is the only basis vector associated to a non-zeroeigenvalues, λ₁>0. The color model of emotion j is thus given by {tildeover (x)}_(j)=v₁ ^(T) x _(j). Similarly, {tilde over (x)}_(n)=v₁ ^(T) x_(n).

Similar results are obtained when using SDA, KSDA, Deep Networks andother classifiers.

Modifying Image Color to Change the Apparent Emotion Expressed by aFace.

The neutral expressions, I_(in) can be modified by the system to appearto express an emotion. These can be called modified images Ĩ_(ij), wherei specifies the image or individual in the image, and j the emotioncategory. Ĩ_(ij) corresponds to the modified color feature vectorsy_(ij)=x_(in)+α({tilde over (x)}_(j)−{tilde over (x)}_(n)), with α>1. Insome embodiments, to create these images, the system modifies the k^(th)pixel of the neutral image using the color model of emotion j asfollows:

${{\overset{\sim}{I}}_{ijk} = {{\left( \frac{I_{ink} - w_{g}}{\varrho_{g}} \right)\left( {\varrho_{g} + {\beta{\overset{\sim}{\varrho}}_{g}}} \right)} + \left( {w_{g} + {\beta{\overset{\sim}{w}}_{g}}} \right)}},$where I_(ink) is the k^(th) pixel of the neutral image I_(in),I_(ink)∈d_(g)(I_(in)), w_(g) and

are the mean and standard deviation of the color of the pixels in theg^(th) Delaunay triangle, and {tilde over (w)}_(g) and

are the mean and standard deviation of the color of the pixels in d_(g)as given by the new model y_(ij).

In some embodiments, the system smooths the modified images with a r byr Gaussian filter with variance σ. The smoothing eliminates localshading and shape features, forcing people to focus on the color of theface and making the emotion category more apparent.

In some embodiments, the system modifies images of facial expressions ofemotion to decrease or increase the appearance of the expressed emotion.To decrease the appearance of the emotion j, the system can eliminatethe color pattern associated to emotion j to obtain the resulting imageI_(ij) . The images are computed as above using the associated featurevectors z_(ij) =x_(ij)−β({tilde over (x)}_(j)−{tilde over (x)}_(n)),i=1, . . . 184,j=1, . . . 18, β>1.

To increase the perception of the emotion, the system defines the newcolor feature vectors as z_(ij) ⁺=x_(ij)+α({tilde over (x)}_(j)−{tildeover (x)}_(n)), i=1, . . . 184,j=1, . . . 18, α>1 to obtain resultingimages I_(ij) ⁺.

FIG. 6 illustrates the color variance system 500 for detecting AUs oremotions using color variance in video and/or still images. The system600 includes an image database component 610 having a set of videosand/or images. The system 600 includes a landmark component 620 thatdetects landmarks in the image database 610. The landmark component 620creates a subset of the set of images of images with defined landmarks.The system 600 includes a statistics component 630 that calculateschanges in color in a video sequence or statistics in a still image of aface. From the statistics component 630, AUs or emotions are determinedfor each video or image in the database component 610 as describedabove. The system 600 includes a tagging component 640 that tags theimages with at least one AU or no AU. The system 600 can store thetagged images in a processed image database 650.

The system 600 includes a modification component 660 that can changeperceived emotions in an image. In some embodiments, after the system600 determines a neutral face in an image the modification component 660modifies the color of the image of the neutral face to yield or changethe appearance of a determined expression of an emotion or an AU. Forexample, an image is determined to include a neutral expression. Themodification component 660 can alter the color in the image to changethe expression to perceive a predetermined expression such as happy orsad.

In other embodiments, after the system 600 determines an emotion or AUin a face in an image, the modification component 660 modifies the colorof the image to increase or decrease the intensity of the emotion or AUto change the perception of the emotion or the AU. For example, an imageis determined to include a sad expression. The modification component660 can alter the color in the image to make the expression be perceivedas less or more sad.

Global-Local Fitting in DNNs for Fast and Accurate Detection andRecognition of Facial Landmark Points and Action Units

In another aspect, a Global-Local loss function for Deep Neural Networks(DNNs) is presented that can be efficiently used in fine-graineddetection of similar object landmark points (e.g., facial landmarkpoints) of interest as well as fine-grained recognition of objectattributes, e.g., AUs. The derived local+global loss yields accuratelocal results without the need to use patch-based approaches and resultsin fast and desirable convergences. The instant Global-Local lossfunction may be used for the recognition of AUs or detecting faces andfacial landmark points necessary for the recognition of AUs and facialexpressions.

Global-Local Loss

Derivations of a global-local (GL) loss that can be efficiently used indeep networks for detection and recognition in images. A system can usethis loss to train a deep DNN to recognize AUs. The system uses aportion of the DNN to detect facial landmark points. These detectionsare concatenated with the output of the fully connected layer of theother components of the network to detect AUs.

Local Fit

The system defines image samples and corresponding output variables asthe set {(I_(l), y_(l)), . . . , (I_(n),y_(n))}, where I_(i)∈R^(l×m) isa l×m-pixel image of a face, y_(i) is the true (desirable) output, and nis the number of samples.

In some embodiments, the output variable y_(i) can be in various forms.For example, in the detection of 2D object landmark points in images,y_(i) is a vector of p 2D image coordinates y_(i)=(u_(il), v_(il), . . ., u_(ip), v_(ip))^(T), (u_(ij), v_(ij))^(T) the j^(th) landmark points.In the recognition of AUs, the output variable corresponds to anindicator vector y_(i)=(y_(il), . . . , y_(iq))T, with y_(ij)=1 if AU jis present in image I_(i) and y_(ij)=−1 when AU j is not present in thatimage.

The system identifies a vector of mapping functionsƒ(I_(i),w)=(ƒ₁(I_(i), w₁), . . . , ƒ_(r)(I_(i),w_(r)))^(T) that convertsthe input image I_(i), to an output vector y_(i) of detections orattributes, and w=(w₁, . . . , w_(r))^(T) is the vector of parameters ofthese mapping functions. Note that r=p and ƒ(.)=(û_(il), {circumflexover (v)}_(il), . . . , û_(ip), {circumflex over (v)}_(ip))^(T) indetection, here with ƒ_(j)(I_(i), w_(j))=(û_(ij), {circumflex over(v)}_(ij))^(T) as the estimates of the 2D image coordinates u_(ij) andv_(ij). Similarly, r=q and f(.)=(ŷ_(i1), . . . , ŷ_(iq))^(T) in therecognition of AUs, where ŷ_(ij) is the estimate of whether AU j ispresent (1) or not (−1) in image I_(i), and q is the number of AUs.

For a fixed mapping function ƒ(I_(i), w) (e.g., a DNN), the systemoptimizes w, defined as

$\begin{matrix}{{{\mathcal{J}\left( \overset{\sim}{w} \right)} = {\min\limits_{w}{\mathcal{L}_{local}\left( {{f\left( {I_{i},w} \right)},y_{i}} \right)}}},} & (33)\end{matrix}$

where

_(local)(.) denotes the loss function. A classical solution for thisloss function is the L²-loss, defined as,

$\begin{matrix}{{{\mathcal{L}_{local}\left( {{f\left( {I_{i},w} \right)},y_{i}} \right)} = {r^{- 1}{\sum\limits_{j = 1}^{r}\left( {{f_{j}\left( {I_{i},w_{j}} \right)} - y_{ij}} \right)^{2}}}},} & (34)\end{matrix}$

where y_(ij) is the j^(th) element of y_(i), which is y_(ij)∈R² in thedetection of face landmark points and y_(ij)∈{−1, +1} in the recognitionof AUs.

Without loss of generality, the system uses f_(i) in lieu of ƒ(I_(i),w)and ƒ_(ij) instead of ƒ_(j) (I_(i),w_(j)). Note that the functionsƒ_(ij) are the same for all i, but may be different for distinct valuesof j.

The above derivations correspond to a local fit. That is, (33) and (34)attempt to optimize the fit of each one of the outputs independently andthen take the average fit over all outputs.

The above derived approach has several solutions, even for a fixedfitting error

(.). For example, the error can be equally distributed across alloutputs ∥ƒ_(ij)−y_(ij)∥₂≈∥ƒ_(ik)−y_(ik)∥₂, ∀j, k, where ∥.∥₂ is the2-norm of a vector. Or, most of the error may be in one (or a few) ofthe estimates, defined as∥ƒ_(ij) −y _(ij)∥₂>>∥ƒ_(k) −y _(ik)∥₂ and ∥ƒ_(k) −y _(ik)∥₂≈0, ∀k≠j

In some embodiments, an additional constraint is added to minimize thefunction

$\begin{matrix}{\frac{2}{r\left( {r + 1} \right)}{\sum\limits_{1 \leq j < k \leq r}{{\left( {f_{ij} - y_{ij}} \right) - \left( {f_{ik} - y_{ik}} \right)}}^{\alpha}}} & (35)\end{matrix}$

with a≥1. The system adds global criterion that facilitates convergence.

Adding Global Structure

The system defines a set of constraints to add global structureextending global descriptors. The constraint in (34) is local because itmeasures the fit of each element of y_(i) (i.e., y_(ij)) independently.The same criterion can nonetheless be used to measure the fit of pairsof points; formally defined as

$\begin{matrix}{{{\mathcal{L}_{pairs}\left( {f_{i},y_{i}} \right)} = {\frac{2}{r\left( {r + 1} \right)}{\sum\limits_{1 \leq j < k \leq r}\left( {{g\left( {{h\left( f_{ij} \right)},{h\left( f_{ik} \right)}} \right)} - {g\left( {y_{ij},y_{ik}} \right)}} \right)^{2}}}},} & (36)\end{matrix}$

where g(x,z) is a function that computes the similarity between its twoentries, and h(.) scales the (unconstrained) output of the network intothe appropriate value range In landmark detection, h(ƒ_(ij))=ƒ_(ij)∈R²andg(x,z)=∥x−z∥ _(b)  (37)

is the b-norm of x−z (e.g., the 2-norm, g(x, z)=√{square root over((x−z)^(T)(x−z)))} where x and z are 2D vectors de fining the imagecoordinates of two landmarks.

In AU recognition, h(ƒ_(ij))=sign(ƒ_(ij))∈{−1, +1} and

$\begin{matrix}{{g\left( {x_{ij},x_{ik}} \right)} = \left\{ {\begin{matrix}{1,} & {{{if}\mspace{14mu} x_{ij}} = x_{ik}} \\{0,} & {otherwise}\end{matrix},} \right.} & (38)\end{matrix}$

where sign(.) returns −1 if the input number is negative and +1 if thisnumber is positive or zero. x_(ij) is 1 if AU j is present in imageI_(i) and −1 if it is not present in that image. Hence, the functionh(.): R→{−1, +1}.

In some embodiments, the system takes into account the global structureof each pair of elements, i.e., each pair of landmark points indetection and each pair of AUs in recognition. That is, in detection,the system uses the information of the distance between all landmarkpoints and, in recognition, determines where pairs of AUs co-occur (e.g.meaning that the two are simultaneously present or not present in thesample image).

In some embodiments, the global criterion can be extended to triplets.Formally,

$\begin{matrix}{{{\mathcal{L}_{trip}\left( {f_{i},y_{i}} \right)} = {\begin{pmatrix}r \\3\end{pmatrix}^{- 1}{\sum\limits_{1 \leq j < k < ɛ \leq r}\left\lbrack {{g\left( {{h\left( f_{ij} \right)},{h\left( f_{ik} \right)},{h\left( f_{is} \right)}} \right)} - {g\left( {y_{ij},y_{ik},y_{is}} \right)}} \right\rbrack^{2}}}},} & (39)\end{matrix}$

where g(x, z, u) is now a function that computes the similarity betweenits three entries.

In detection, this means the system can compute the b-norm, e.g., g(x,z, u)=∥(x−z)+(z−u)∥_(b), and the area of the triangle defined by eachtriplet is calculated as

$\begin{matrix}{{{g\left( {x,z,u} \right)} = {\frac{1}{2}{{\left( {x - z} \right) \times \left( {x - u} \right)}}}},} & (40)\end{matrix}$

where the three landmark points are non-collinear.

In some embodiments, the equations can be extended to four or morepoints. For example, the equation can be extended to convexquadrilaterals as g(x, z, u, v)=½|(x−u)×(z−v)| |.

In the most general case, for t landmark points, the system computes anarea of the polygon envelope, i.e., a non-self-intersecting polygoncontained by the t landmark points {x_(i1), . . . , x_(it)} the polygonis given as follows.

The system computes a Delaunay triangulation of the facial landmarkpoints. The polygon envelop is obtained by connecting the lines of theset oft landmark points in counter-clockwise order. The ordered set oflandmark points is defined as {circumflex over (x)}_(i)={{circumflexover (x)}_(i1), . . . , {circumflex over (x)}_(it)}. The area in{circumflex over (x)}_(i) is given by,

$\begin{matrix}{{{g_{a}\left( {\overset{\sim}{x}}_{i} \right)} = {\frac{1}{2}\left\lbrack {\left( {\sum\limits_{k = 1}^{t - 1}\left( {{{\overset{\sim}{x}}_{{ik}\; 1}{\overset{\sim}{x}}_{{i{({k + 1})}}2}} - {{\overset{\sim}{x}}_{{ik}\; 2}{\overset{\sim}{x}}_{{i{({k + 1})}}1}}} \right)} \right) + \left( {{{\overset{\sim}{x}}_{{it}\; 1}{\overset{\sim}{x}}_{i\; 12}} - {{\overset{\sim}{x}}_{i\; 12}{\overset{\sim}{x}}_{{it}\; 1}}} \right)} \right\rbrack}},} & (41)\end{matrix}$

where subscript a in g_(a)(.) denotes “area.” and {tilde over(x)}_(ik)=({tilde over (x)}_(ik1), {tilde over (x)}_(ik2))^(T).

In some embodiments, the result in the above equation is obtained usingGreen's theorem as known in the art and, {circumflex over (x)}_(i) caneither be the t outputs of DNN {tilde over (ƒ)}_(i)={{tilde over(ƒ)}_(ij), . . . , {tilde over (ƒ)}_(it)} or the true values {tilde over(y)}_(k)={{tilde over (y)}_(ij), . . . , {tilde over (y)}_(it)}.

The system can compute the global b-norm, g_(n)(.), for the general caseof t landmark points as,

$\begin{matrix}{{g_{n}\left( {\overset{\sim}{x}}_{i} \right)} = {\sum\limits_{k = 1}^{t - 1}{{{{\overset{\sim}{x}}_{{ik}\; 1} - {\overset{\sim}{x}}_{{i{({k + 1})}}2}}}_{b}.}}} & (42)\end{matrix}$

The above derivations define the extension of g(.) to three and morepoints in detection problems. From this, the above can be used torecognize AUs in images.

The system computes a co-occurrence of three or more AUs in image I_(i).Formally, {tilde over (x)}_(i)={{tilde over (x)}_(i1), . . . , {tildeover (x)}_(it)} be a set of t AUs, with {tilde over (x)}_(ij)∈{−1, +1},j=1, . . . , t, and

$\begin{matrix}{{g_{AU}\left( {\overset{\sim}{x}}_{i} \right)} = \left\{ {\begin{matrix}{1,} & {{{if}\mspace{14mu}{\overset{\sim}{x}}_{i\; 1}} = {\ldots = {\overset{\sim}{x}}_{it}}} \\{0,} & {otherwise}\end{matrix}.} \right.} & (43)\end{matrix}$

GL-Loss

The final local-global (GL) loss function is given by,

(f _(i) ,y _(i))=α₀

_(local)(f _(i) ,y _(i))+

_(global)(f _(i) ,y _(i)),  (44)

where the global loss

_(global) is defined as

$\begin{matrix}{{{\mathcal{L}_{global}\left( {f_{i},y_{i}} \right)} = {\sum\limits_{t = 1}^{t_{\max}}{\alpha_{t}\left\lbrack {{g\left( {{h\left( {\overset{\sim}{f}}_{ij} \right)},\ldots\mspace{14mu},{h\left( {\overset{\sim}{f}}_{it} \right)}} \right)} - {g\left( {{\overset{\sim}{y}}_{ij},\ldots\mspace{14mu},{\overset{\sim}{y}}_{it}} \right)}} \right\rbrack}}},} & (45)\end{matrix}$

g(.) is either g_(a)(.) or g_(n)(.) or both in detection and g_(AU)(.)in recognition, and α, are normalizing constants learned usingcross-validation on a training set.

Backpropagation

To optimize the parameters of the DNN, w, the system computes

$\begin{matrix}{\frac{\partial\mathcal{L}}{\partial w} = {{\alpha_{0}\frac{\partial\mathcal{L}_{local}}{\partial w}} + {\frac{\partial\mathcal{L}_{global}}{\partial w}.}}} & (46)\end{matrix}$

The partial derivatives of the local loss is of course given by

$\begin{matrix}{\frac{\partial\mathcal{L}_{global}}{\partial w_{j}} = {\frac{2}{r}\frac{\partial f_{ij}}{\partial w_{j}}{\left( {f_{ij} - y_{ij}} \right).}}} & (47)\end{matrix}$

The definition of the global loss uses a mapping function h(.). In someembodiments, when performing landmark detection, h(ƒ_(ij))=ƒ_(ij) andthe partial derivatives of the global loss have the same form as thoseof the local loss shown in the equation above. In other embodiments,when performing AU recognition, the system usesh(ƒ_(ij))=sign(ƒ_(ij))∈{−1,+1}. The function is not differentiable,however, the system redefines it as h(ƒ_(ij))=ƒ_(ij)/√{square root over(ƒ_(ij) ²+∈)}, for a small ∈>0. The partial derivative becomes∂h(ƒ_(ij))/∂w_(j)=½+1/√{square root over (ƒ_(ij)+∈)}.

Deep DNN

The system includes a deep neural network for recognition of AUs. TheDNN includes two parts. The first part of the DNN is used to detect alarge number of facial landmark points. The landmark points allow thesystem to compute GL-loss as described above.

The system can compute normalized landmark points. The system canconcatenate with the output of a first fully connected layer of a secondpart of the DNN to embed the location information of the landmarks intothe DNN used to recognize AUs. This facilitates the detection of localshape changes typically observed in the expression of emotion. This isdone in the definition of the GL-loss above.

In some embodiments, the DNN includes multiple layers. In an exemplaryembodiment, nine layers are dedicated to the detection of faciallandmark points, and other layers are used to recognize AUs in a set ofimages.

The layers devoted to the detection of facial landmark points aredetailed as follows.

Facial Landmark Point Detection

In the exemplary embodiment, the DNN includes three convolutionallayers, two max pooling layers and two fully connected layers. Thesystem can apply normalization, dropout, and rectified linear units(ReLU) at the end of each convolutional layer.

The weights in these layers are optimized using back-propagation—usingthe derived GL-loss. The global loss and the backpropagation equationsare provided above.

In an example, the system uses this part of the DNN to detect a total of66 facial landmark points. One advantage of the proposed GL-loss is thatit can be efficiently trained on very large datasets. In someembodiments, the system includes a facial landmark detector that employsdata augmentation to be invariant to affine transformations and partialocclusions.

The facial landmark detector generates additional images by applyingtwo-dimensional affine transformations to the existing training set,i.e., scale, reflection, translation and rotation. In an exemplaryembodiment, scale can be taken between 2 and 0.5, rotation can be −10°to 10°, and translation and reflection can be randomly generated. Tomake the DNN more robust to partial occlusions, the system randomizesoccluding boxes of d×d pixels, with d between 0.2 and 0.4 times theinter-eye distance.

AU Recognition

The second part of the DNN combines the face appearance features withthe landmark locations given by the first part of the DNN. Specifically,in the output of the first fully connected layer of the second part ofthe DNN, the appearance image features are concatenated with thenormalized automatically detected landmark points.

Formally, let s_(i)=(s_(i1) ^(T), . . . , s_(ip) ^(T))^(T) be the vectorof landmark points in the i_(th) sample image (i=1, . . . ,n), wheres_(ik)∈R² are the 2D image coordinates of the k^(th) landmark, and n isthe number of sample images. Thus s_(i)∈R¹³². All images are thennormalized to have the same inter-eye distance of τ pixels. That is,ŝ_(i)=cs_(i), where

${c = \frac{\tau}{{{1 - r}}_{2}}},$where l and r are the image coordinates of the center of the left andright eye, ∥.∥₂ defines the 2-norm of a vector, ŝ_(i)=(ŝ_(i1) ^(T), . .. , ŝ_(ip) ^(T))^(T) and can use τ=200.

The system normalizes the landmark points as ŝ′_(ik)=R(ŝ_(ik)−Î)+Î,where Î=cl, and multiplies the landmark points with a rotation matrix Rto make the outer corner of left and right eye match a horizontal line.The system rescales and shifts the values ŝ′_(i) to move the outercorner of the left and right eye in an image to the pre-determinedpositions of (0.5, 0) and (−0.5, 0), respectively.

In one embodiment, the DNN is similarly to that of GoogleNet but withthe major difference that the herein defined GL-loss is used. The inputof the DNN can be a face image. The system changes the size of thefilter in the first layer to adapt to the input, and randomlyinitializes the weight for these filters. In order to embed landmarks inthe DNN, the number of filters in the first fully connected layer can bechanged as well as the number of filters for output as the number ofAUs. The system can employ a single DNN can be employed to detect allAUs in images of facial expressions.

The weights of the second part of the DNN can be optimized usingbackpropagation methods and with the global loss defined above.

In some embodiments, the data augmentation can be performed by addingrandom noise to 2D landmark points, and applying the affinetransformations described above.

In some embodiments, the system can be trained to initialize recognitionof AUs in the wild using a training database as described above.

FIG. 7 illustrates a network system 700 for detecting AUs and emotioncategories using deep neural networks (DNN) in video and/or stillimages. The system 700 includes an image database component 710 having aset of videos and/or images. The system 700 includes a DNN 720 thatdetermines AUs in the set of images of the image database 710. The DNN720 includes a first part 730 that defines landmarks in the set ofimages as described above. The DNN 720 includes a second part 740 thatdetermines AUs in the landmarks in the set of images in the databasecomponent 710 as described above. The system 700 includes a taggingcomponent 750 that tags the images with at least one AU or no AU. Thesystem 700 can store the tagged images in a processed image database760.

Example Computing Device

FIG. 8 illustrates an exemplary computer that can be used forconfiguring hardware devices in an industrial automation system. Invarious aspects, the computer of FIG. 8 may comprise all or a portion ofthe development workspace 100, as described herein. As used herein,“computer” may include a plurality of computers. The computers mayinclude one or more hardware components such as, for example, aprocessor 821, a random access memory (RAM) module 822, a read-onlymemory (ROM) module 823, a storage 824, a database 825, one or moreinput/output (I/O) devices 826, and an interface 827. Alternativelyand/or additionally, controller 820 may include one or more softwarecomponents such as, for example, a computer-readable medium includingcomputer executable instructions for performing a method associated withthe exemplary embodiments. It is contemplated that one or more of thehardware components listed above may be implemented using software. Forexample, storage 824 may include a software partition associated withone or more other hardware components. It is understood that thecomponents listed above are exemplary only and not intended to belimiting.

Processor 821 may include one or more processors, each configured toexecute instructions and process data to perform one or more functionsassociated with a computer for indexing images. Processor 821 may becommunicatively coupled to RAM 822. ROM 823, storage 824, database 825,I/O devices 826, and interface 827. Processor 821 may be configured toexecute sequences of computer program instructions to perform variousprocesses. The computer program instructions may be loaded into RAM 822for execution by processor 821. As used herein, processor refers to aphysical hardware device that executes encoded instructions forperforming functions on inputs and creating outputs.

RAM 822 and ROM 823 may each include one or more devices for storinginformation associated with operation of processor 821. For example, ROM823 may include a memory device configured to access and storeinformation associated with controller 820, including information foridentifying, initializing, and monitoring the operation of one or morecomponents and subsystems. RAM 822 may include a memory device forstoring data associated with one or more operations of processor 821.For example, ROM 823 may load instructions into RAM 822 for execution byprocessor 821.

Storage 824 may include any type of mass storage device configured tostore information that processor 821 may need to perform processesconsistent with the disclosed embodiments. For example, storage 824 mayinclude one or more magnetic and/or optical disk devices, such as harddrives, CD-ROMs, DVD-ROMs, or any other type of mass media device.

Database 825 may include one or more software and/or hardware componentsthat cooperate to store, organize, sort, filter, and/or arrange dataused by controller 820 and/or processor 821. For example, database 825may store hardware and/or software configuration data associated withinput-output hardware devices and controllers, as described herein. Itis contemplated that database 825 may store additional and/or differentinformation than that listed above.

I/O devices 826 may include one or more components configured tocommunicate information with a user associated with controller 820. Forexample, I/O devices may include a console with an integrated keyboardand mouse to allow a user to maintain a database of images, updateassociations, and access digital content. I/O devices 826 may alsoinclude a display including a graphical user interface (GUI) foroutputting information on a monitor. I/O devices 826 may also includeperipheral devices such as, for example, a printer for printinginformation associated with controller 820, a user-accessible disk drive(e.g., a USB port, a floppy, CD-ROM, or DVD-ROM drive, etc.) to allow auser to input data stored on a portable media device, a microphone, aspeaker system, or any other suitable type of interface device.

Interface 827 may include one or more components configured to transmitand receive data via a communication network, such as the Internet, alocal area network, a workstation peer-to-peer network, a direct linknetwork, a wireless network, or any other suitable communicationplatform. For example, interface 727 may include one or more modulators,demodulators, multiplexers, de-multiplexers, network communicationdevices, wireless devices, antennas, modems, and any other type ofdevice configured to enable data communication via a communicationnetwork.

While the methods and systems have been described in connection withpreferred embodiments and specific examples, it is not intended that thescope be limited to the particular embodiments set forth, as theembodiments herein are intended in all respects to be illustrativerather than restrictive.

Unless otherwise expressly stated, it is in no way intended that anymethod set forth herein be construed as requiring that its steps beperformed in a specific order. Accordingly, where a method claim doesnot actually recite an order to be followed by its steps or it is nototherwise specifically stated in the claims or descriptions that thesteps are to be limited to a specific order, it is no way intended thatan order be inferred, in any respect. This holds for any possiblenon-express basis for interpretation, including: matters of logic withrespect to arrangement of steps or operational flow: plain meaningderived from grammatical organization or punctuation; the number or typeof embodiments described in the specification. Throughout thisapplication, various publications may be referenced. The disclosures ofthese publications in their entireties are hereby incorporated byreference into this application in order to more fully describe thestate of the art to which the methods and systems pertain. It will beapparent to those skilled in the art that various modifications andvariations can be made without departing from the scope or spirit. Otherembodiments will be apparent to those skilled in the art fromconsideration of the specification and practice disclosed herein. It isintended that the specification and examples be considered as exemplaryonly, with a true scope and spirit being indicated by the followingclaims.

What is claimed is:
 1. A computer-implemented method for analyzing animage comprising: maintaining a plurality of kernel spaces ofconfigural, shape and shading features, wherein each kernel space isnon-linearly separable to other kernel space, and wherein each kernelspace is associated with one or more action units (AUs) and one or moreAU intensity values; receiving, by a computing system, a plurality ofimages to be analyzed; and for each received image: determining facespace data of configural (including shape), and shading features of aface in the image, wherein a face space includes, a configural featurevector and a shading feature vector associated with shading changes inthe face; and determining none, one or more AU values for the image bycomparing the determined face space data of configural feature to theplurality of kernel spaces to determine presence of the determined facespace data of configural and shading features, wherein the imagefeatures comprises: landmark points derived using a deep neural networkcomprising a global-local (GL) loss function, and image features toidentify AUs, AU intensities, emotion categories and their intensitiesderived using the deep neural network comprising a global-local (GL)loss function that is configured to backpropagate both local and globalfit of landmark points projected over the image.
 2. The method of claim1, comprising: processing, in real-time, a video stream comprising aplurality of images to determine AU values and AU intensity values foreach of the plurality of images.
 3. The method of claim 1, wherein thedetermined face space data of the configural features comprises distanceand angle values between normalized landmarks in Delaunay trianglesformed from the image and angles defined by each of the Delaunaytriangles corresponding to the normalized landmark.
 4. The method ofclaim 1, wherein the shading feature vector associated with shadingchanges in the face are determined by: applying Gabor filters tonormalized landmark points determined from the face.
 5. The method ofclaim 1, wherein the AU value and AU intensity value, collectively,define an emotion and an emotion intensity.
 6. The method of claim 1,wherein the image comprises a photograph.
 7. The method of claim 1,wherein the image comprises a frame of a video sequence.
 8. The methodof claim 1, wherein the computing system uses a video sequence from acontrolled environment or an uncontrolled environment.
 9. The method ofclaim 1, wherein the computing system uses black and white images orcolor images.
 10. The method of claim 1, comprising: receiving an image;and processing the received image to determine an AU value and an AUintensity value for a face in the received image.
 11. The method ofclaim 1, comprising: receiving a first plurality of images from a firstdatabase; receiving a second plurality of images from a second database;and processing the received first plurality and second plurality ofimages to determine for each image thereof an AU value and an AUintensity value for a face in each respective image, wherein the firstplurality of images has a first captured configuration and the secondplurality of images has a second captured configuration, wherein thefirst captured configuration is different form the second capturedconfiguration.
 12. The method of claim 1, comprising: performing kernelsubclass discriminant analysis (KSDA) on the face space; and recognizingAU and AU intensities, emotion categories, and emotion intensities basedon the KSDA.